DNA sequencing is the determination of the order of the four nucleotides in a nucleic acid molecule.
The first-generation sequencing emerged in 1970s when Allan Maxam and Walter Gilbert developed a chemical method for sequencing, followed by Frederick Sanger who developed the chain-terminator method (also known as a Sanger sequencing method). Both methods were used in shotgun sequencing, which involves breaking the DNA into fragments and then sequncing the fragments individually. The first step in Sanger sequencing is that of PCR. However, sample DNA is divided into for reaction tubes fluorescently labelled by di-deoxynucleotide tripospates: ddATP, ddGTP, ddCTP, and ddTTP. The synthesis terminates in DNA fragments in individual neucleotides which are then separated by size and identified using gel electophoresis.The DNA bands are then graphed by autoradiography and the order of the nucleotide bases on the DNA sequence can be directly read from Xray film.
Next-generation sequencing (NGS) is a massively parallel DNA sequencing technology that can analyze hundreds of thousands of DNA fragments simultaneously. Compared to earlier sequencing technology, NGS provides sequencing information on multiple fragments simultaneously. DNA is first made single strandard, and fluorescently labelled, and then allow to pair with known libraries/reference genomes. Fluorescent signals detected by the sensors is used to identify the DNA sequence.
Next-Generation Sequencing refers to a class of technologies that sequence millions of short DNA fragments in parallel, with a relatively low cost. The length and number of the reads differ based on the technology. Currently, there are three major commercially available platforms/technologies for NGS: (1) Illumina, (2) Roche, and (3) Life Technologies. The underlying chemistry and technique used in each platform is unique and affects the output.
NGS sequencing enables a wide variety of applications, allowing researchers to ask virtually any question related to the genome, transcriptome, or epigenome of any organism. NGS methods differ primarily by how the DNA or RNA samples are prepared and the data analysis options used. The technology is used to determine the order of nucleotides in entire genomes or targeted regions of DNA or RNA.
Third generation sequencing (TGS) is recent sequencing technique that addresses drawbacks of NGS, such as the needs for long reads and poor resolution (ie., poor and weak reads). The foundation of TGS emerged when DNA polymerase was used to obtain a sequence of single DNA molecules, which involves (i) direct imaging of individual DNA molecules using advanced microscopy techniques and (ii) nanopore sequencing technique in which a single molecule of DNA is threaded through a nanopore ( a pore of nanoscale). The DNA’s passage through the pore creates signals that can be converted to read the DNA sequence
In general, there are two TGS technologies available: (i) Pacific Bioscience (PacBio) single molecule real time sequencing, and (ii) Oxford nanopore technologies.
WGS is a comprehensive method of analyzing the entire genomic DNA of a cell at a single time by using sequencing techniques such as Sanger sequencing, shotgun approach, or high throughput NGS sequencing. It is also known as full genome sequencing or complete genome sequencing. WGS enables scientists to read the exact sequence of all the letters that make up the complete set of DNA.
WES focuses on the genomic protein coding regions (exons). Although WES requires additional reagents (probes) and some additional steps (hybridization), it is a cost-effective, widely used NGS method that requires fewer sequencing reagents and takes less time to perform bioinformatic analysis compared to WGS. Although the human exome represents only 1-5% of the genome, it contains approximately 85% of known disease-related variants. Despite lengthier sample preparation due to the additional target enrichment step, scientists benefit from quicker sequencing and data analysis compared to WGS.
WES provides greater sequencing depth for researchers interested in identifying genetic variants for numerous applications, including population genetics, genetic disease research, and cancer studies.
The biological results and interpretation of sequencing data for different sequencing applications are greatly affected by the number of sequenced reads that cover the genomic regions.
The sequencing depth measures the average read abundance and it is calculated as the number of bases of all sequenced short reads that match a genome divided by the length of that genome.
The coverage gives the average number of reads that align to cover known reference bases. If the reads are equal \[ \text{Coverage} = \frac{\text{read length}(bp) \times \text{number of reads}}{\text{genome size}(bp)}\]
If the reads are not equal in length, the coverage is calculated as \[\text{Coverage} = \frac{\sum_{i=1}^n \text{length of read}\ i}{\text{genome size}(bp)}\] where \(n\) is the number of sequece reads.
The process of inferring the base at a specific position of the sequenced DNA fragment during the sequencing process is called the base calling. Most sequencing platforms are equipped with base calling programs that assign a Phred quality score to measure the accuracy of each base called. The Phred quality score (\(Q\)-score) transforms the probability of calling a base wrongly into an integer core that is easy to interpret. The Phred score \(Q\) is defined as \[ p = 10^{-Q/10}\] \[ Q = -10 \log_{10} (p)\] where \(p\) is the probability of the base call being wrong.
The FASTA format was developed as a text-based format to represent nucleotide or protein sequences. An extension of the FASTA format is FASTQ format that is designed to handle base quality metrics output from sequencing machines. In this format, both the sequence and quality scores are represented as single ASCII characters. The format uses four lines for each sequence, and these four lines are stacked on top of each other in text files output by sequencing workflows.
A FastQC file is a quality control report generated by the
FastQC software, which is used to assess the quality of raw
sequence data (typically in FASTQ format) from high-throughput
sequencing experiments, providing a visual overview of potential issues
like low quality bases, adapter contamination, or nucleotide bias within
the reads, allowing researchers to identify problems before further
analysis.
Read mapping refers to alignment of reads from high throughput analysis into a reference genome. The basic alignment algorithms do not work here because millions of reads have to be aligned to a reference genomes at the same time. Therefore, efficient indexing algorithms are needed to organize the sequence of the reference genome and the short reads in a memory efficient manner and to facilitate fast searching of the patterns. The commonly used data structures are:
Suffix tree is basically used for pattern matching and finding substrings in a given string of characters. It is constructed as a key and value pairs where all possible suffixes of the reference sequences as keys and the positions (indexes) in the sequence as the values. The following algorithm, known as Ukkonen’s algorithm constructs a suffix tree in a linear time.
For example:
let us assume that our reference sequence consists of 10 bases
as “CTTGGCTGGA$” where the positions are 0, 1, … 9, and $ is in the
empty trailing position. Let us form the suffixes (keys) and indexes
(values):
$ 10
A $ 9
GA $ 8
GGA $ 7
TGGA $ 6
CTGGA $ 5
GCTGGA $ 4
GGCTGGA $ 3
TGGCTGGA $ 2
TTGGCTGGA $ 1
CTTGGCTGGA $ 0
Note that each line consists of suffix (key) and a positon (value). Then a suffix tree can be made using key-value pairs as edges and nodes of the tree, respectively.
Once the suffix tree is built, there are several searching algorithms are available to find the location where a read maps. For instance, to find “TGG” in the tree, we will start looking for T and and then GG. And since there are two leaf nodes, “TGG” can map at positions 2 or 6.
Suffix arrays can be constructed from suffix trees. It is basically a sorted array of all suffixes in a given sequence. We can quickly search for locations begin with suffixes “TGG” as 7 and 3.
The Borrows-Wheeler Transform (BWT) is a data structure that
transform a string (eg., reference sequence) into a compressible form
that allows fast searching. The BWT is used by popular aligners like
BWA and Bowtie. The BTW of the sequence \(s\) or \(bwt(s)\) is computed by generating cyclic
rotations of the sequences and then are sorted alphabetically to form a
matrix called BWM (Borrows-Wheeler Matrix).
To get the \(i\)th characeter of BWT of the string \(s\), whose characters are indexed by \(i = 1,2, \ldots n\), from the suffix array \(a\), simply use:
\[bwt(s)[i] = s[a[i]-1]\ \text{if}\ a[i] > 1\ \text{else}\ s[n]\] BWT could be obtained from a suffix array in a linear time complexity by using the above transform.
|bwt| A | G | G | $ | G | G | T | T | C | T |
BWT serves two purposes. First, BWT groups the characters of the sequence so that a single character appaers many times in a raw because the column is sorted alphabetically and that can be used for sequence compression. The second purpose is BWT data structure can be used for indexing the sequence of reference genome to make finding the position fast.
De novo genome assembly comes to play when there is no reference genome available for the organism. The de novo genome assembly aims to join reads into a contiguous sequence called a contig. Multiple contigs are joined together to form a scaffold and multiple scaffolds can be linked to form a chromosome. Assemblying the entire genome is usually challenging but with deep sequencing (with high coverage and depth), most of the challenges can be overcome.
The algorithms used for de novo genome assembly are:
It depends on similarity between reads to create a pile up of aligned sequences, which are collapsed to create contigs from the consensus sequences. The greedy algorithm uses pairwise alignment to compare all reads to identify the most similar reads with sufficient overlaps to be merged. The process continues repeatedly until there is no more merging.
The overlap-consensus algorithm performs pairwise alignment and then it represents the reads and the overlaps between the reads with graphs, where contiguous reads are the nodes and their overlaps are the edges.
Each node \(r_i\) corresponds to a read and any two reads are connected by an edge \(e(r_1, r_2)\) where the suffix is the first read \(r_1\) matches the prefix of the second read \(r_2\). The algorithm then find the Hamiltonian path of the graph, which includes the nodes (reads) exactly once. Contigs are then created from the consensus sequencees of the overlapped suffixes and prefixes.
In de Bruijn graphs, each read is broken into overlapping substrings of length \(k\) called overlapping \(k\)-mers. The \(k\)-mers are then represented by graphs in which each \(k\)-mer is a vertex. Any two nodes of any two \(k\)-mers that share a commeon prefix and suffix of length \(k-1\) are connected by an edge. The contiguous reads are merged by finding the Eulerian path, which include every edge exactly once.
The DNA-Seq analysis pipeline identifies somatic variants (changes in DNA, which occurs in cells other than germ cells) within whole exome sequencing (WXS) and whole genome sequencing (WGS) data. Somatic variants are identified by comparing allele frequencies in normal and affected sample alignments, annotating each mutation, and aggregating mutations from multiple cases into one project file.
Variant calling is the process by which we identify variants from sequence data:
In a DNA sequencing pipeline, “filtering” refers to the process of removing low-quality reads or alignment artifacts from the raw sequencing data before further analysis, typically done by applying quality score thresholds to eliminate reads with poor base calls, adapter sequences, or potential PCR duplicates, significantly improving the accuracy of downstream variant calling and analysis.
Very often, one would want to know if a particular genomic feature falls in a particular genomic region of interest. This task can be handled very well by \(\texttt{GRanges}\) and \(\texttt{SummarizedExperiment}\) objects. We will look at a few ways to creating these objects and a few ways we can manupulate them.
library(GenomicRanges)
library(rtracklayer) # interface to genome annotation files
library(SummarizedExperiment)
Create \(\texttt{GRanges}\) from different files: GFF, BED, and text files.
The General Feature Format (GFF) is a plain text file format used to describe the features of DNA, RNA, and protein sequences. It’s used in bioinformatics to store genomic sequences and annotations. The BED (Browser Extensible Data) format is a text file format used to store genomic regions as coordinates and associated annotations.
get_granges_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
get_granges_from_bed <- function(file_name){
bed <- rtracklayer::import.bed(file_name)
as(bed, "GRanges")
}
get_granges_from_text <- function(file_name){
df <- readr::read_tsv(file_name, col_names = TRUE )
GenomicRanges::makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
}
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Get annotated regions from a GFF file:
gr_from_gff <- get_annotated_regions_from_gff(file.path(getwd(), "arabidopsis_chr4.gff"))
head(gr_from_gff)
## GRanges object with 6 ranges and 10 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## [1] Chr4 1-18585056 * | TAIR10 chromosome NA <NA>
## [2] Chr4 1180-1536 - | TAIR10 gene NA <NA>
## [3] Chr4 1180-1536 - | TAIR10 mRNA NA <NA>
## [4] Chr4 1180-1536 - | TAIR10 protein NA <NA>
## [5] Chr4 1180-1536 - | TAIR10 CDS NA 0
## [6] Chr4 1180-1536 - | TAIR10 exon NA <NA>
## ID Name Note
## <character> <character> <CharacterList>
## [1] Chr4 Chr4
## [2] AT4G00005 AT4G00005 protein_coding_gene
## [3] AT4G00005.1 AT4G00005.1
## [4] AT4G00005.1-Protein AT4G00005.1
## [5] <NA> <NA>
## [6] <NA> <NA>
## Parent Index Derives_from
## <CharacterList> <character> <character>
## [1] <NA> <NA>
## [2] <NA> <NA>
## [3] AT4G00005 1 <NA>
## [4] <NA> AT4G00005.1
## [5] AT4G00005.1,AT4G00005.1-Protein <NA> <NA>
## [6] AT4G00005.1 <NA> <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Get genomic regaions from text files:
gr_from_txt <- get_granges_from_text(file.path(getwd(), "arabidopsis_chr4.txt"))
head(gr_from_txt)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | type feature_id
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] Chr4 1180-1536 - | gene AT4G00005
## [2] Chr4 2895-10455 - | gene AT4G00020
## [3] Chr4 11355-13359 - | gene AT4G00026
## [4] Chr4 13527-14413 + | gene AT4G00030
## [5] Chr4 14627-16079 + | gene AT4G00040
## [6] Chr4 17792-20066 + | gene AT4G00050
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Extract GRanges of selected gene:
## Extract by seqname or metadata
genes_on_chr4 <- gr_from_gff[ gr_from_gff$type == "gene" & seqnames(gr_from_gff) %in% c("Chr4") ]
head(genes_on_chr4)
## GRanges object with 6 ranges and 10 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## [1] Chr4 1180-1536 - | TAIR10 gene NA <NA>
## [2] Chr4 2895-10455 - | TAIR10 gene NA <NA>
## [3] Chr4 11355-13359 - | TAIR10 gene NA <NA>
## [4] Chr4 13527-14413 + | TAIR10 gene NA <NA>
## [5] Chr4 14627-16079 + | TAIR10 gene NA <NA>
## [6] Chr4 17792-20066 + | TAIR10 gene NA <NA>
## ID Name Note Parent Index
## <character> <character> <CharacterList> <CharacterList> <character>
## [1] AT4G00005 AT4G00005 protein_coding_gene <NA>
## [2] AT4G00020 AT4G00020 protein_coding_gene <NA>
## [3] AT4G00026 AT4G00026 protein_coding_gene <NA>
## [4] AT4G00030 AT4G00030 protein_coding_gene <NA>
## [5] AT4G00040 AT4G00040 protein_coding_gene <NA>
## [6] AT4G00050 AT4G00050 protein_coding_gene <NA>
## Derives_from
## <character>
## [1] <NA>
## [2] <NA>
## [3] <NA>
## [4] <NA>
## [5] <NA>
## [6] <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Manually creat a region of interest and read it as \(\texttt{GRanges}\) object:
## By range, create synthetic ranges
region_of_interest_gr <- GRanges(
seqnames = c("Chr4"),
IRanges(c(10000), width= c(1000))
)
region_of_interest_gr
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] Chr4 10000-10999 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Find the overlapping genes of Chr4 in the simulated
region:
overlap_hits <- findOverlaps(region_of_interest_gr, genes_on_chr4)
features_in_region <- genes_on_chr4[subjectHits(overlap_hits) ]
features_in_region
## GRanges object with 1 range and 10 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## [1] Chr4 2895-10455 - | TAIR10 gene NA <NA>
## ID Name Note Parent Index
## <character> <character> <CharacterList> <CharacterList> <character>
## [1] AT4G00020 AT4G00020 protein_coding_gene <NA>
## Derives_from
## <character>
## [1] <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Create a random \(\texttt{SummarizedExperiment}\) that uses \(\texttt{GRanges}\) object and matrix from simulated data.
set.seed(4321)
experiment_counts <- matrix( runif(4308 * 6, 1, 100), 4308) # generate six experiments
sample_names <- c(rep("ctrl",3), rep("test",3) ) # three control and test samples
se <- SummarizedExperiment(rowRanges = gr_from_txt, assays = list(experiment_counts), colData = sample_names)
Let us find the overlapping regions in the simulated experiment data. Assay returns number of hits.
overlap_hits <- findOverlaps(region_of_interest_gr, se)
data_in_region <- se[subjectHits(overlap_hits) ]
assay(data_in_region)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 91.00481 34.41582 42.7602 36.13053 47.6775 19.21672
Often times, gene annotations are not usually available. Let us look a first stage pipline for finding potential genes and genomic loci of interest absolutely de novo and without information beyond the sequence.
Let us use a simple set of rules to find open reading frames (ORF) - sequences that begin with a start codon and end with a stop codon. Will use \(\texttt{systemPipeR}\) package to find ORF and create \(\texttt{GRanges}\) object that can be used for downstream analysis.
library(Biostrings)
library(systemPipeR) # to predict ORF
Read the genomic sequence of arabidopsis chloroplast
from fasta file.
# Use arabidopsis_chloroplast genome sequence as input (from a fasta file)
dna_object <- readDNAStringSet(file.path(getwd(), "arabidopsis_chloroplast.fa"))
Compute the properties of reference genome
bases <- c("A", "C", "T", "G")
raw_seq_string <- strsplit(as.character(dna_object), "")
seq_length <- width(dna_object[1])
counts <- letterFrequency(dna_object[1], letters = c("A", "T", "G", "C"))
probs <- unlist(lapply(counts, function(base_count){signif(base_count / seq_length, 2) }))
dna_object[1]
## DNAStringSet object of length 1:
## width seq names
## [1] 154478 ATGGGCGAACGACGGGAATTGAA...TAATAACTTGGTCCCGGGCATC chloroplast
probs
## [1] 0.31 0.32 0.18 0.18
predicted_orfs <- predORF(dna_object, n='all', type='gr', mode='ORF', strand = 'both',
longest_disjoint = TRUE)
predicted_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 1162 chloroplast 2056-2532 - | 1 3
## 2 chloroplast 72371-73897 + | 2 2
## 1163 chloroplast 77901-78362 - | 2 1
## 3 chloroplast 54937-56397 + | 3 3
## ... ... ... ... . ... ...
## 2497 chloroplast 129757-129762 - | 1336 3
## 2498 chloroplast 139258-139263 - | 1337 3
## 2499 chloroplast 140026-140031 - | 1338 3
## 2500 chloroplast 143947-143952 - | 1339 3
## 2501 chloroplast 153619-153624 - | 1340 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Find the longest ORF length of random genome and keep the ORF that are longer than that in the given genome:
# write a function to return longest ORF
get_longest_orf_in_random_genome <- function(x,
length = 1000,
probs = c(0.25, 0.25, 0.25, 0.25),
bases = c("A","C","T","G")) {
random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names(random_dna_object) <- c("random_dna_string")
orfs <- predORF(random_dna_object, n = 1, type = 'gr', mode='ORF', strand = 'both',
longest_disjoint = TRUE)
return(max(width(orfs)))
}
# generate 10 simulated random genomes
random_lengths <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length,
probs = probs, bases = bases))
# get length of longest random ORF
longest_random_orf <- max(random_lengths)
# Keep only the predicted ORF longer than the longest random ORF
keep <- width(predicted_orfs) > longest_random_orf
orfs_to_keep <- predicted_orfs[keep]
orfs_to_keep
## GRanges object with 10 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 2 chloroplast 72371-73897 + | 2 2
## 3 chloroplast 54937-56397 + | 3 3
## 4 chloroplast 57147-58541 + | 4 1
## 5 chloroplast 33918-35141 + | 5 1
## 6 chloroplast 32693-33772 + | 6 2
## 7 chloroplast 109408-110436 + | 7 3
## 8 chloroplast 114461-115447 + | 8 2
## 9 chloroplast 141539-142276 + | 9 2
## 10 chloroplast 60741-61430 + | 10 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##writing the resutls to a file in fasta format
extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")
Genomic variation refers to DNA sequence differences among individuals or populations.
also known as single base substitutions, are the simplest type of variation as they only involve the change of one base for another in a DNA sequence. These can be subcategorized into transitions (Ti) and transversions (Tv); Tis are changes between two purines or between two pyrimidines, whereas the latter involve a change from a purine to a pyrimidine or vice versa.
If the SNV is common in a population (usually with an allele frequency > 1%), then it is referred to as a SNP (single nucleotide polymorphism).
which are sequence variants that involve consecutive change of two or more bases. Similarly to SNVs, there are some MNVs that are found at higher frequencies in the population, which are referred to as MNPs (multi-nucleotide polymorphism).
which involve the gain or loss of one or more bases in a sequence. Usually, what is referred to as indel tends to be only a few bases in length.
which are genomic variations that involve larger segments of the genome:
There is not a strict rule defining the number of base pairs that make the difference between an indel and a structural variant, but usually, a gain or loss of DNA would be called a structural variant if it involved more than one kilobase of sequence.
A naive approach to determining the genotypes from a pile of sequencing reads mapped to a site in the genome is to count the number of variants in alternate alleles against the reference. And then establish hard genotype thresholds.
In this method, reads are aligned to the reference sequence (green) and a threshold of the proportion of reads supporting each allele for calling genotypes is established (top). Then, at each position, the proportion of reads supporting the alternative allele is calculated and, based on the dosage of the alternative allele, a genotype is established. Yellow: a position where a variant is present but the proportion of alternative alleles does not reach the threshold (1/6 < 0.25).
The Bayesian approach includes information about the prior probability of a variant occurring at that site and the amount of information supporting each of the potential genotypes. The posterior probability of a genotype \(g\) given genomic data \(d\) is given by
\[ P(g | d) = \frac{P(d|g)P(g)}{P(d)}\] where \(P(d|g)\) is the likelihood of sequence data \(d\) given the genotype and \(P(g)\) the prior probability of genotype \(g\). The prior probabilities of a paticular genotype can be estimated over the whole sequence. The likelood can be generated from the alignment of sequences (For every combination of reference allele (site types) and nucleotide in every read, the probability of the observed allele being the same as the reference is calculated. These probabilities are then multiplied for all nucleotides in the reads at that position).
\(P(d)\) can be computed as \[ P(d) = \sum_{g^\prime} P(d|g^\prime) P(g^\prime)\]
these methods rely on heuristic quantities and algorithms to call a variant site, such as a minimum coverage and alignment quality thresholds, and stringent cut-offs like a minimum number of reads supporting a variant allele.
A key task in sequence analysis is to take an alignment of high-throughput sequences stored in BAM file (Binary Alignment Map) and compute a list of variant positions. Note that after FASTAQ file alignment to the reference genome, the outputs appear in BAM files. The results of variant calling is usually stored in a VCF file of variants.
library(GenomicRanges) # handles genomic locations within a genome
library(gmapR) # align short-range data
library(rtracklayer) #interface to genome annotation files and the UCSC genome browser
library(VariantAnnotation)
library(VariantTools)
We will use a set of synthetic reads from human genome chromosome 17. Let us first read \(\texttt{.bam}\) and \(\texttt{.fa}\) files:
# get the directory of bam files folder
bam_folder <- file.path(getwd())
# reference genome
bam_file <- file.path( bam_folder, "hg17_snps.bam")
# testing genome
fasta_file <- file.path(bam_folder,"chr17.83k.fa")
Create a \(\texttt{GmapGenome}\) object
fa <- rtracklayer::FastaFile(fasta_file)
genome <- gmapR::GmapGenome(fa, create=TRUE)
Create a parameter object and call the Varints
qual_params <- TallyVariantsParam(genome = genome, minimum_mapq = 20) # min Q
var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)
called_variants <- callVariants(bam_file, qual_params, calling.filters = var_params)
head(called_variants)
## VRanges object with 6 ranges and 17 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] NC_000017.10 64 * G T 759
## [2] NC_000017.10 69 * G T 812
## [3] NC_000017.10 70 * G T 818
## [4] NC_000017.10 73 * T A 814
## [5] NC_000017.10 77 * T A 802
## [6] NC_000017.10 78 * G T 798
## refDepth altDepth sampleNames softFilterMatrix | n.read.pos
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <integer>
## [1] 739 20 <NA> | 17
## [2] 790 22 <NA> | 19
## [3] 796 22 <NA> | 20
## [4] 795 19 <NA> | 13
## [5] 780 22 <NA> | 19
## [6] 777 21 <NA> | 17
## n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
## <integer> <integer> <integer> <integer> <integer>
## [1] 64 759 20 739 0
## [2] 69 812 22 790 0
## [3] 70 818 22 796 0
## [4] 70 814 19 795 0
## [5] 70 802 22 780 0
## [6] 70 798 21 777 0
## count.minus.ref count.del.plus count.del.minus read.pos.mean
## <integer> <integer> <integer> <numeric>
## [1] 0 0 0 30.9000
## [2] 0 0 0 40.7273
## [3] 0 0 0 34.7727
## [4] 0 0 0 36.1579
## [5] 0 0 0 38.3636
## [6] 0 0 0 39.7143
## read.pos.mean.ref read.pos.var read.pos.var.ref mdfne mdfne.ref
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 32.8755 318.558 347.804 NA NA
## [2] 35.4190 377.004 398.876 NA NA
## [3] 36.3442 497.762 402.360 NA NA
## [4] 36.2176 519.551 402.843 NA NA
## [5] 36.0064 472.327 397.070 NA NA
## [6] 35.9241 609.076 390.463 NA NA
## count.high.nm count.high.nm.ref
## <integer> <integer>
## [1] 20 738
## [2] 22 789
## [3] 22 796
## [4] 19 769
## [5] 22 780
## [6] 21 777
## -------
## seqinfo: 1 sequence from chr17.83k genome
## hardFilters(4): nonRef nonNRef readCount likelihoodRatio
Write VCF file
VariantAnnotation::sampleNames(called_variants) <- "sample_name"
vcf <- VariantAnnotation::asVCF(called_variants)
VariantAnnotation::writeVcf(vcf, "hg17.vcf")
Load the annotations and feature positions
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
get_annotated_regions_from_bed <- function(file_name){
bed <- rtracklayer::import.bed(file_name)
as(bed, "GRanges")
}
get the test genome
genes <- get_annotated_regions_from_gff(file.path( bam_folder, "chr17.83k.gff3"))
calculate which variants overlap with genes and subest the genes
overlaps <- GenomicRanges::findOverlaps(called_variants, genes)
genes[subjectHits(overlaps)][1:6]
## GRanges object with 6 ranges and 20 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## [1] NC_000017.10 64099-76866 - | havana ncRNA_gene NA <NA>
## [2] NC_000017.10 64099-76866 - | havana lnc_RNA NA <NA>
## [3] NC_000017.10 64099-65736 - | havana exon NA <NA>
## [4] NC_000017.10 64099-76866 - | havana ncRNA_gene NA <NA>
## [5] NC_000017.10 64099-76866 - | havana lnc_RNA NA <NA>
## [6] NC_000017.10 64099-65736 - | havana exon NA <NA>
## ID Name biotype description
## <character> <character> <character> <character>
## [1] gene:ENSG00000280279 AC240565.2 lincRNA novel transcript
## [2] transcript:ENST00000.. AC240565.2-201 lincRNA <NA>
## [3] <NA> ENSE00003759547 <NA> <NA>
## [4] gene:ENSG00000280279 AC240565.2 lincRNA novel transcript
## [5] transcript:ENST00000.. AC240565.2-201 lincRNA <NA>
## [6] <NA> ENSE00003759547 <NA> <NA>
## gene_id logic_name version Parent
## <character> <character> <character> <CharacterList>
## [1] ENSG00000280279 havana 1
## [2] <NA> <NA> 1 gene:ENSG00000280279
## [3] <NA> <NA> 1 transcript:ENST00000..
## [4] ENSG00000280279 havana 1
## [5] <NA> <NA> 1 gene:ENSG00000280279
## [6] <NA> <NA> 1 transcript:ENST00000..
## tag transcript_id transcript_support_level constitutive
## <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] basic ENST00000623180 5 <NA>
## [3] <NA> <NA> <NA> 1
## [4] <NA> <NA> <NA> <NA>
## [5] basic ENST00000623180 5 <NA>
## [6] <NA> <NA> <NA> 1
## ensembl_end_phase ensembl_phase exon_id rank
## <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] <NA> <NA> <NA> <NA>
## [3] -1 -1 ENSE00003759547 5
## [4] <NA> <NA> <NA> <NA>
## [5] <NA> <NA> <NA> <NA>
## [6] -1 -1 ENSE00003759547 5
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Often, we want to see on a chromosome or genetic map where some features of interest lie in relation to others. These plots are called chromosome plots or ideograms, and we will see how \(\texttt{karyoploteR}\) can do.
library(karyoploteR)
library(GenomicRanges)
Create a \(\texttt{GRanage}\) object: five genomic ranges on chromosomes 1-5:
genome_df <- data.frame(
chr = paste0("chr", 1:5),
start = rep(1, 5),
end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)
genome_gr <- makeGRangesFromDataFrame(genome_df)
genome_gr
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-34964571 *
## [2] chr2 1-22037565 *
## [3] chr3 1-25499034 *
## [4] chr4 1-20862711 *
## [5] chr5 1-31270811 *
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
Set up SNPs positions to draw as markers
snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
chr = paste0("chr", sample(1:5,25, replace=TRUE)),
start = snp_pos,
end = snp_pos
)
snps_gr <- makeGRangesFromDataFrame(snps)
snp_labels <- paste0("snp_", 1:25)
snps_gr
## GRanges object with 25 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr5 5731177 *
## [2] chr3 6855485 *
## [3] chr5 1445614 *
## [4] chr1 9442378 *
## [5] chr4 3018935 *
## ... ... ... ...
## [21] chr4 7395203 *
## [22] chr3 3367345 *
## [23] chr4 185686 *
## [24] chr4 3180060 *
## [25] chr1 9916951 *
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
Plot SNP locations:
plot.params <- getDefaultPlotParams(plot.type=1)
plot.params$data1outmargin <- 600
kp <- plotKaryotype(genome=genome_gr, plot.type = 1, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
We can add some numeric data to the plot if they are available. For example, lets introduce some random plots for chr4:
### create random data to chr4
numeric_data <- data.frame(
y = rnorm(100,mean = 1,sd = 0.5 ),
chr = rep("chr4", 100),
start = seq(1,20862711, 20862711/100),
end = seq(1,20862711, 20862711/100)
)
numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)
plot.params <- getDefaultPlotParams(plot.type=2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800
kp <- plotKaryotype(genome=genome_gr, plot.type = 2, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel=2)
We often want to know if a locus has been duplicated or its copy number has increased. Our approach is to use DNA-seq read coverage after alignment to estimate a background level of coverage and then inspect the coverage on the region of interest. The ratio of coverage give an estimate of the copy number of the region.
library(csaw) # Bioconductor package for find CNV
Get counts in windows across the hg17 genome:
whole_genome <- csaw::windowCounts(
file.path(getwd(), "hg17_snps.bam"), # the bam file contains counts
bin = TRUE,
filter = 0,
width = 100, # the width of the window
param = csaw::readParam(
minq = 20, # minimum mapping quality
dedup = TRUE,
pe = "both" # paired-end or single-end or both
)
)
colnames(whole_genome) <- c("h17")
Extract the data from \(\texttt{SummarizedExperiment}\) object. Set a threshold and make lower counts to NA.
counts <- assay(whole_genome)[,1]
min_count <- quantile(counts, 0.1)[[1]]
counts[counts < min_count] <- NA # lowest counts to NA
Calculate the mean coverage and ratio in each window to that mean coverage, and inspect the ratio vector with a plot.
n <- length(counts)
doubled_windows <- 10
left_pad <- floor( (n/2) - doubled_windows )
right_pad <- n - left_pad -doubled_windows
multiplier <- c(rep(1, left_pad ), rep(2,doubled_windows), rep(1, right_pad) )
counts <- counts * multiplier
mean_cov <- mean(counts, na.rm=TRUE)
ratio <- matrix(log2(counts / mean_cov), ncol = 1)
plot(ratio)
Build a \(\texttt{SummarisedExperiment}\) with new data
se <- SummarizedExperiment(assays=list(ratio), rowRanges= rowRanges(whole_genome),
colData = c("CoverageRatio"))
Create a region of interest and extract the coverage from it
region_of_interest <- GRanges(
seqnames = c("NC_000017.10"),
IRanges(c(40700), width = c(1500) )
)
overlap_hits <- findOverlaps(region_of_interest, se)
data_in_region <- se[subjectHits(overlap_hits)]
assay(data_in_region)
## [,1]
## [1,] 0.01725283
## [2,] 0.03128239
## [3,] -0.05748994
## [4,] 0.05893873
## [5,] 0.94251006
## [6,] 0.88186246
## [7,] 0.87927929
## [8,] 0.63780103
## [9,] 1.00308550
## [10,] 0.75515798
## [11,] 0.80228189
## [12,] 1.05207419
## [13,] 0.82393626
## [14,] NA
## [15,] NA
## [16,] -0.16269298
In the output, we see the region has roughly a \(\log2\) raio of 1 (two-fold) relative to the background, we can interpret as a copy number of 2.
Genome-wide association studies (GWAS) of genotype and phenotypes finds the presence of gentic variants in many samples of high-throughput sequencing data. GWAS is a genomic analysis of gentic variants in different individuals to see any particular variant is associated with a trait in a large population.
There are various approaches for doing GWAS, which rely on gathering
data on vaiants in particular samples and working out each sample’s
genotype before cross-referencing with the phenotype in some way or
other. We will use the GWAS function from
rrBLUP package.
The goal of GWAS, is to test for association between the frequency of each of hundreds of thousands of common variants and a given phenotype, that exceed a conservative genome-wide threshold for association and then test these for evidence of replication.
Linear regression models for GWAS can be written as follows: \[Y = W\alpha+ X \beta + g+ e\]
where, for each individual, \(Y\) is a vector of phenotype values, \(W\) is a matrix of covariates including an intercept term, \(\alpha\) is a corresponding vector of effect sizes, \(X\)s is a vector of genotype values for all individuals at SNPs, \(β\)s is the corresponding fixed effect size of genetic variants (also known as the SNP effect size), \(g\) is a random effect that captures the polygenic effect of other SNPs, \(e\) is a random effect of residual errors.
library(VariantAnnotation)
library(rrBLUP)
set.seed(1234) # for reproducibility
Get the VCF (Variant Call Format) file
vcf <- readVcf("small_sample.vcf", "hg19")
header(vcf)
## class: VCFHeader
## samples(3): NA00001 NA00002 NA00003
## meta(3): fileformat reference contig
## fixed(1): FILTER
## info(3): DP AF DB
## geno(2): GT DP
Extract the genotype, sample, and marker position information
samples <- samples(header(vcf)) # take a sample
samples
## [1] "NA00001" "NA00002" "NA00003"
chrom <- as.character(seqnames(rowRanges(vcf)))
chrom
## [1] "20" "20" "20"
pos <- as.numeric(start(rowRanges(vcf)))
pos
## [1] 14370 17330 1230237
gts <- geno(vcf)$GT
gts
## NA00001 NA00002 NA00003
## rs6054257 "0/0" "0/1" "1/1"
## 20:17330_T/A "0/0" "0/1" "0/0"
## 20:1230237_T/G "0/0" "0/0" "1/0"
markers <- rownames(gts)
markers
## [1] "rs6054257" "20:17330_T/A" "20:1230237_T/G"
Convert VCF genotypes into the convention used by the GWAS function:
convert <- function(v){
v <- gsub("0/0", 1, v) # homozygous
v <- gsub("0/1", 0, v) # Heterozygous
v <- gsub("1/0", 0, v) # Heterozygous
v <- gsub("1/1",-1, v) # Homozygous
return(v)
}
Call the function and convert the result into a numeric matrix to map heterozygous and homozygous annotations to that used in \(\texttt{GWAS}\).
gt_char<- apply(gts, convert, MARGIN = 2)
genotype_matrix <- matrix(as.numeric(gt_char), nrow(gt_char) )
colnames(genotype_matrix)<- samples
genotype_matrix
## NA00001 NA00002 NA00003
## [1,] 1 0 -1
## [2,] 1 0 1
## [3,] 1 1 0
Build a dataframe describing the variant
variant_info <- data.frame(marker = markers, chrom = chrom, pos = pos)
variant_info
## marker chrom pos
## 1 rs6054257 20 14370
## 2 20:17330_T/A 20 17330
## 3 20:1230237_T/G 20 1230237
Build a variant/genotype dataframe
genotypes <- cbind(variant_info, as.data.frame(genotype_matrix))
genotypes
## marker chrom pos NA00001 NA00002 NA00003
## 1 rs6054257 20 14370 1 0 -1
## 2 20:17330_T/A 20 17330 1 0 1
## 3 20:1230237_T/G 20 1230237 1 1 0
Build a phenotype dataframe by assigning random values
phenotypes <- data.frame(
line = samples,
score = rnorm(length(samples)) # generate random numbers as phenotypes
)
phenotypes
## line score
## 1 NA00001 -1.2070657
## 2 NA00002 0.2774292
## 3 NA00003 1.0844412
Run GWAS to map genotypes and phenotypes:
GWAS(phenotypes, genotypes, plot=FALSE)
## [1] "GWAS for trait: score"
## [1] "Variance components estimated. Testing markers."
## marker chrom pos score
## 1 rs6054257 20 14370 0.3010543
## 2 20:17330_T/A 20 17330 0.3010057
## 3 20:1230237_T/G 20 1230237 0.1655498
Returns a data frame where the first three columns are the marker name, chromosome, and position, and subsequent columns are the marker scores (\(p\)-value) for the traits.
Here the GWAS analysis is carried out using the R package
`rrBLUP
library("rrBLUP")
Load the example data. The simulated data sets used here are an example modified from wheat dataset; with 932 phenotypic yield data observations, and 329 genotypic data with 3629 markers, which are mapped.
load("GWAS_Data.RData") # load "pheno", "geno", and "map"
dim(pheno)
## [1] 932 3
head(pheno)### View phenotypic data.
## GID ENV Yield
## 1 Oat179 Env1 6317.606
## 2 Oat130 Env2 6335.475
## 3 Oat303 Env4 7259.274
## 4 Oat270 Env1 6916.124
## 5 Oat202 Env4 6845.943
## 6 Oat233 Env3 5750.001
str(pheno) ### Display variables structure. GID and ENV needs to be factors.
## 'data.frame': 932 obs. of 3 variables:
## $ GID : Factor w/ 330 levels "Oat1","Oat10",..: 89 36 228 191 116 150 205 25 153 264 ...
## $ ENV : Factor w/ 4 levels "Env1","Env2",..: 1 2 4 1 4 3 2 2 3 2 ...
## $ Yield: num 6318 6335 7259 6916 6846 ...
Genotypic data contains 329 lines and 3629 markers. The map file contains 3354 markers and three variables.
dim(geno)
## [1] 329 3629
geno[1:5,1:5] ### View genotypic data.
## Marker1 Marker2 Marker3 Marker4 Marker5
## Oat1 NA 1 1 1 1
## Oat2 -1 NA 1 1 1
## Oat3 -1 1 1 1 NA
## Oat4 -1 NA -1 1 1
## Oat5 -1 NA -1 1 1
map[1:5,1:3] ### View Map data.
## Markers chrom loc
## 1 Marker607 1A 16730
## 2 Marker2900 1A 16730
## 3 Marker1316 1A 25338
## 4 Marker2297 1A 26595
## 5 Marker1895 1A 27071
Checking normality of the data. In theory residuals needs to be checked but in general if data are normal, residuals are normal. Data seems pretty normal with some outliers.
hist(pheno$Yield, col="black",xlab="Yield",ylab="Frequency",
border="white",breaks=10,main="Yield Histogram")
Shapiro-Wilk test indicates that normality condition is not met.
shapiro.test(pheno$Yield)
##
## Shapiro-Wilk normality test
##
## data: pheno$Yield
## W = 0.99392, p-value = 0.0007821
Let´s remove the outliers and check the normality again.
boxplot.yield<- boxplot(pheno$Yield,xlab="Boxplot",ylab="Yield",ylim=c(4000,9000))
outliers <- boxplot.yield$out
outliers #10 outliers
## [1] 8979.334 3944.849 3947.001 3948.641 4576.506 3942.469 8166.286 4616.835
## [9] 4672.310 8920.055
pheno <- pheno[-which(pheno$Yield%in%outliers),] #removing outliers.
shapiro.test(pheno$Yield)# After removing outliers, Shapiro-Wilk test indicates Normality.
##
## Shapiro-Wilk normality test
##
## data: pheno$Yield
## W = 0.99719, p-value = 0.1093
pheno <- na.omit(pheno)## To remove all posible misiing data.
In this simple function, we remove individuals with missing data, markers with a certain % of missing, and heterozygous calls (IF there are a high proportion of heterozygous, it can indicate a problem with the marker, because oat is an inbreed species)
filter.fun <- function(geno,IM,MM,H){
#Remove individuals with more than a % missing data
individual.missing <- apply(geno,1,function(x){
return(length(which(is.na(x)))/ncol(geno))
})
#length(which(individual.missing>0.40)) #will tell you how many
#individulas needs to be removed with 20% missing.
#Remove markers with % missing data
marker.missing <- apply(geno,2,function(x)
{return(length(which(is.na(x)))/nrow(geno))
})
length(which(marker.missing>0.6))
#Remove markers herteozygous calls more than %.
heteroz <- apply(geno,1,function(x){
return(length(which(x==0))/length(!is.na(x)))
})
filter1 <- geno[which(individual.missing<IM), which(marker.missing<MM)]
filter2 <- filter1[,(heteroz<H)]
return(filter2)
}
geno[1:10,1:10]
## Marker1 Marker2 Marker3 Marker4 Marker5 Marker6 Marker7 Marker8 Marker9
## Oat1 NA 1 1 1 1 1 -1 -1 -1
## Oat2 -1 NA 1 1 1 1 -1 -1 -1
## Oat3 -1 1 1 1 NA 1 -1 NA -1
## Oat4 -1 NA -1 1 1 1 -1 -1 1
## Oat5 -1 NA -1 1 1 NA -1 -1 1
## Oat6 -1 1 1 NA -1 NA -1 1 -1
## Oat7 NA 1 -1 -1 -1 NA 1 1 -1
## Oat8 -1 NA -1 -1 NA NA NA -1 1
## Oat9 -1 1 1 -1 -1 1 -1 -1 -1
## Oat10 -1 NA 1 1 1 NA -1 1 -1
## Marker10
## Oat1 NA
## Oat2 1
## Oat3 1
## Oat4 NA
## Oat5 NA
## Oat6 NA
## Oat7 NA
## Oat8 NA
## Oat9 NA
## Oat10 NA
geno.filtered <- filter.fun(geno[,1:3629],0.4,0.60,0.02)
geno.filtered[1:5,1:5];dim(geno.filtered)
## Marker1 Marker2 Marker3 Marker4 Marker5
## Oat1 NA 1 1 1 1
## Oat2 -1 NA 1 1 1
## Oat3 -1 1 1 1 NA
## Oat4 -1 NA -1 1 1
## Oat5 -1 NA -1 1 1
## [1] 328 3268
The main idea behind imputation is to predict (or ‘impute’) the
missing data based upon the observed data. Here, A.mat
function from ‘rrBLUP’ package is used for imputation by using either
means or EM algorithm.
rrBLP program will make imputation. For the simplicity ,
we impute using the mean but EM algorithm can be also used.
rrBLUP also allows to remove markers depending on the minor
allele frequency (MAF), in our example we remove those markers with MAF
less than 0.05.
Imputation <- A.mat(geno.filtered,impute.method="EM",return.imputed=T,min.MAF=0.05)
## [1] "A.mat converging:"
## [1] 0.0431
## [1] 0.00647
K.mat <- Imputation$A ### KINSHIP matrix
K.mat[1:5,1:5] ## view Kinship
## Oat1 Oat2 Oat3 Oat4 Oat5
## Oat1 1.8191561 0.220454719 0.1009423 0.15218203 0.177660657
## Oat2 0.2204547 2.111943356 0.1266988 0.02978199 -0.007214392
## Oat3 0.1009423 0.126698834 1.8000414 -0.12678093 -0.126589635
## Oat4 0.1521820 0.029781989 -0.1267809 1.87520005 1.803271696
## Oat5 0.1776607 -0.007214392 -0.1265896 1.80327170 1.873595678
geno.gwas <- Imputation$imputed #NEW geno data.
geno.gwas[1:5,1:5] ## view geno
## Marker1 Marker3 Marker4 Marker5 Marker7
## Oat1 -1.44066 1 1 1.0000000 -1
## Oat2 -1.00000 1 1 1.0000000 -1
## Oat3 -1.00000 1 1 0.6753843 -1
## Oat4 -1.00000 -1 1 1.0000000 -1
## Oat5 -1.00000 -1 1 1.0000000 -1
One crucial step in GWAS analysis is to study the population structure. The main reason to perform this study is that, as a consequence of having different population genetic histories, distinct subpopulations could have differences in allele frequencies for many polymorphisms throughout the genome.
## Principal components analysis
geno.scale <- scale(geno.gwas,center=T,scale=F) # Data needs to be center.
svdgeno <- svd(geno.scale)
PCA <- geno.scale%*%svdgeno$v #Principal components
PCA[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## Oat1 -6.976549 -9.824339 5.290272 -1.184019 5.00415366
## Oat2 -8.745900 -10.545057 4.730541 -8.834242 -0.02504027
## Oat3 -6.292199 3.412671 4.774038 6.657838 8.93925754
## Oat4 -19.524147 7.689386 -7.963162 -13.644082 -6.19429306
## Oat5 -19.334593 8.011715 -7.624697 -13.033030 -6.53379527
## Screeplot to visualize the proportion of variance explained by PCA
plot(round((svdgeno$d)^2/sum((svdgeno$d)^2),d=7)[1:10],type="o",main="Screeplot",
xlab="PCAs",ylab="% variance")
##Proportion of variance explained by PCA1 and PCA2
PCA1 <- 100*round((svdgeno$d[1])^2/sum((svdgeno$d)^2),d=3); PCA1
## [1] 8.6
PCA2 <- 100*round((svdgeno$d[2])^2/sum((svdgeno$d)^2),d=3); PCA2
## [1] 5.2
### Plotting Principal components.
plot(PCA[,1],PCA[,2],xlab=paste("Pcomp:",PCA1,"%",sep=""),ylab=paste("Pcomp:",PCA2,"%",sep=""),
pch=20,cex=0.7)
If there is a hint of polymophism, you can check with PCA after clustering
### Plotting depending on clustering.
Eucl <- dist(geno.gwas) ###Euclinean distance
fit <- hclust(Eucl,method="ward.D2")### Ward criterion makes clusters with same size.
groups2 <- cutree(fit,k=2) ### Selecting two clusters.
table(groups2)# Number of individuals per cluster.
## groups2
## 1 2
## 251 77
plot(PCA[,1],PCA[,2],xlab=paste("Pcomp:",PCA1,"%",sep=""),ylab=paste("Pcomp:",PCA2,"%",sep=""),
pch=20,cex=0.7,col=groups2)
legend("bottomright",c("Subpop1: 244","Subpop2: 84"),pch=20,col=(c("black","red")),
lty=0,bty="n",cex=1)
The following code prepares the files to be used on GWAS function. In order to do the analysis the same genotypes needs to be on phenotype and genotypes files. A matching function between files needs to be performed.
pheno=pheno[pheno$GID%in%rownames(geno.gwas),]
pheno$GID<-factor(as.character(pheno$GID), levels=rownames(geno.gwas)) #to assure same levels on both files
pheno <- pheno[order(pheno$GID),]
##Creating file for GWAS function from rrBLUP package
X<-model.matrix(~-1+ENV, data=pheno)
pheno.gwas <- data.frame(GID=pheno$GID, X, Yield=pheno$Yield)
head(pheno.gwas)
## GID ENVEnv1 ENVEnv2 ENVEnv3 ENVEnv4 Yield
## 77 Oat1 0 0 1 0 6126.514
## 81 Oat1 1 0 0 0 6613.542
## 330 Oat1 0 0 0 1 5984.449
## 417 Oat1 0 1 0 0 6855.848
## 441 Oat2 0 1 0 0 5510.549
## 442 Oat2 1 0 0 0 5711.477
# after imputation, we have to align genotype and phenotype data again
geno.gwas <- geno.gwas[rownames(geno.gwas)%in%pheno.gwas$GID,]
pheno.gwas <- pheno.gwas[pheno.gwas$GID%in%rownames(geno.gwas),]
geno.gwas <- geno.gwas[rownames(geno.gwas)%in%rownames(K.mat),]
K.mat <- K.mat[rownames(K.mat)%in%rownames(geno.gwas),colnames(K.mat)%in%rownames(geno.gwas)]
pheno.gwas <- pheno.gwas[pheno.gwas$GID%in%rownames(K.mat),]
# we need to do the same for 'map' data
geno.gwas<-geno.gwas[,match(map$Markers,colnames(geno.gwas))]
geno.gwas <- geno.gwas[,colnames(geno.gwas)%in%map$Markers]
map <- map[map$Markers%in%colnames(geno.gwas),]
head(map)
## Markers chrom loc
## 4 Marker2297 1A 26595
## 9 Marker3125 1A 35232
## 10 Marker2100 1A 35653
## 13 Marker1797 1A 51943
## 14 Marker3191 1A 51943
## 16 Marker1403 1A 56393
geno.gwas2<- data.frame(mark=colnames(geno.gwas),chr=map$chrom,loc=map$loc,t(geno.gwas))
dim(geno.gwas2)
## [1] 1759 331
colnames(geno.gwas2)[4:ncol(geno.gwas2)] <- rownames(geno.gwas)
head(pheno.gwas)
## GID ENVEnv1 ENVEnv2 ENVEnv3 ENVEnv4 Yield
## 77 Oat1 0 0 1 0 6126.514
## 81 Oat1 1 0 0 0 6613.542
## 330 Oat1 0 0 0 1 5984.449
## 417 Oat1 0 1 0 0 6855.848
## 441 Oat2 0 1 0 0 5510.549
## 442 Oat2 1 0 0 0 5711.477
geno.gwas2[1:6,1:6]
## mark chr loc Oat1 Oat2 Oat3
## Marker2297 Marker2297 1A 26595 -1 -1.0000000 -1
## Marker3125 Marker3125 1A 35232 1 -1.0000000 -1
## Marker2100 Marker2100 1A 35653 -1 1.0000000 1
## Marker1797 Marker1797 1A 51943 1 -0.4259651 -1
## Marker3191 Marker3191 1A 51943 -1 1.0000000 1
## Marker1403 Marker1403 1A 56393 -1 1.0000000 1
A statistically significant association between a genotypic marker and a particular trait is considered to be a proof of linkage between the phenotype and a casual locus. Generally, population structure (PS) leads to spurious associations between markers and a trait, so that a statistical approach must account for PS
gwasresults<-GWAS(pheno.gwas, geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=NULL, plot=T,n.PC=0)
## [1] "GWAS for trait: Yield"
## [1] "Variance components estimated. Testing markers."
gwasresults2<-GWAS(pheno.gwas, geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=NULL, plot=T,n.PC=6)
## [1] "GWAS for trait: Yield"
## [1] "Variance components estimated. Testing markers."
gwasresults3<-GWAS(pheno.gwas, geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=K.mat, plot=T,n.PC=0)
## [1] "GWAS for trait: Yield"
## [1] "Variance components estimated. Testing markers."
gwasresults4<-GWAS(pheno.gwas, geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=K.mat, plot=T,n.PC = 6)
## [1] "GWAS for trait: Yield"
## [1] "Variance components estimated. Testing markers."
#The option plot=T will produce manhattan plots and q-q plots. With the aim to provide
#another option in R, another graphs in R have been provided.
str(gwasresults)
## 'data.frame': 1759 obs. of 4 variables:
## $ mark : chr "Marker2297" "Marker3125" "Marker2100" "Marker1797" ...
## $ chr : chr "1A" "1A" "1A" "1A" ...
## $ loc : int 26595 35232 35653 51943 51943 56393 66313 68829 69872 70102 ...
## $ Yield: num 0.325 0.451 0.588 0.187 0.861 ...
#Let´s see the structure
#First 3 columns are just the information from markers and map.
#Fouth and next columns are the results form GWAS. Those values are already
#the -log10 pvalues, so no more transformation needs to be done to plot them.
FDR<-function(pvals, FDR){
pvalss<-sort(pvals, decreasing=F)
m=length(pvalss)
cutoffs<-((1:m)/m)*FDR
logicvec<-pvalss<=cutoffs
postrue<-which(logicvec)
print(postrue)
k<-max(c(postrue,0))
cutoff<-(((0:m)/m)*FDR)[k+1]
return(cutoff)
}
alpha_bonferroni =-log10(0.05/length(gwasresults$Yield)) ###This is Bonferroni correcton
alpha_FDR_Yield <- -log10(FDR(10^(-gwasresults$Yield),0.05))## This is FDR cut off
## [1] 1 2 3 4 5 6 7
Match with hits
Species/markers that are highly correlated with the yeild:
which(gwasresults$Yield>alpha_bonferroni)
## [1] 53 56 57 1054 1427
which(gwasresults$Yield>alpha_FDR_Yield)
## [1] 45 53 56 57 1054 1427 1428
which(gwasresults2$Yield>alpha_bonferroni)
## [1] 53 56 57 1054 1427 1428
which(gwasresults2$Yield>alpha_FDR_Yield)
## [1] 45 53 56 57 1054 1427 1428
which(gwasresults3$Yield>alpha_bonferroni)
## [1] 53 56 57 1054 1427
which(gwasresults3$Yield>alpha_FDR_Yield)
## [1] 45 53 56 57 1054 1427 1428
which(gwasresults4$Yield>alpha_bonferroni)
## [1] 53 56 57 1054 1427
which(gwasresults4$Yield>alpha_FDR_Yield)
## [1] 45 53 56 57 1054 1427 1428